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Abstract 

We propose computationally efficient encoders and decoders for lossy compression using 
a Sparse Regression Code. The codebook is defined by a design matrix and codewords are 
structured linear combinations of columns of this matrix. The proposed encoding algorithm 
sequentially chooses columns of the design matrix to successively approximate the source se- 
quence. It is shown to achieve the optimal distortion-rate function for i.i.d Gaussian sources 
under the squared-error distortion criterion. For a given rate, the parameters of the design ma- 
trix can be varied to trade off distortion performance with encoding complexity. An example of 
such a trade-off is: computational resource (space or time) per source sample of 0((n/ log n) 2 ) 
and probability of excess distortion decaying exponentially in n/logn, where n is the block 
length. The Sparse Regression Code is robust in the following sense: for any ergodic source, the 
proposed encoder achieves the optimal distortion-rate function of an i.i.d Gaussian source with 
the same variance. Simulations show that the encoder has very good empirical performance, 
especially at low and moderate rates. 



1 Introduction 

Developing efficient codes for lossy compression at rates approaching the Shannon rate-distortion 
limit has long been one of the important goals of information theory. Efficiency is measured in 
terms of the storage complexity of the codebook as well the computational complexity of encoding 
and decoding. The Shannon-style i.i.d random codebook [I] has optimal performance in terms of 
the trade-off between distortion and rate as well as the error exponent^ [2j[3]. However, both the 
storage and computational complexity of this codebook grow exponentially with the block length. 

In this paper, we study a class of codes called Sparse Superposition or Sparse Regression Codes 
(SPARC) for lossy compression with the squared-error distortion criterion. We present computa- 
tionally efficient encoding and decoding algorithms that provably attain the optimal rate-distortion 
function for i.i.d Gaussian sources. 

"This work was partially supported by NSF Grants CCF-1017744 and CCF-1217023. 

1 The error exponent of a compression code measures how fast the probability of excess distortion decays to zero 
with growing block length. 



The Sparse Regression codebook is constructed based on the statistical framework of high- 
dimensional linear regression, and was proposed recently by Barron and Joseph for communication 
over the AWGN channel at rates approaching the channel capacity (4j|6j . The codewords are sparse 
linear combinations of columns of an n x N design matrix or 'dictionary', where n is the block- 
length and N is a low-order polynomial in n. This structure enables the design of computationally 
efficient encoders based on the rich theory on sparse approximation (e.g., [tJ[8] ) . We propose one 
such encoding algorithm and analyze it performance. 

SPARCs for lossy compression were first considered in [9] where some preliminary results were 
presented. The rate-distortion and error exponent performance of these codes under minimum- 



distance (optimal) encoding are characterized in 10 11. The main contributions of this paper are 
the following. 

• We propose a computationally efficient encoding algorithm for SPARCs which achieves the 
optimal distortion-rate function for i.i.d Gaussian sources with growing block length n. The 
algorithm is based on successive approximation of the source sequence by columns of the 
design matrix. The parameters of the design matrix can be chosen to trade off performance 
with complexity. For example, one choice of parameters discussed in Section [4] yields a n x 
0(n 2 ) design matrix, per-sample encoding complexity proportional to (\^{) 2 , and probability 
of excess distortion decaying exponentially in j^^- To the best of our knowledge, this is the 
fastest rate of decay among lossy compression codes with computationally feasible encoding 
and decoding. 

• With this encoding algorithm, SPARCs share the robustness property of random i.i.d Gaus- 



sian codebooks 12 -14 . Robustness refers to the property of the code that for a given rate R 
any ergodic source with variance a 2 can be compressed with distortion close to the optimal 
i.i.d Gaussian distortion-rate function a 2 e~ 2R . 



The proposed encoding algorithm may be interpreted in terms of successive refinement 15fl6 
Letting L = one may interpret the algorithm as successively refining the source over 

L stages, with rate R/L in each stage. In other words, by successively refining the source 
over an asymptotically large number (L) of stages with asymptotically small rate (R/L) in 
each stage, we attain the optimal Gaussian distortion-rate function with polynomial encoding 
complexity (L 2 ) and probability of excess distortion falling exponentially in L. 

This successive refinement interpretation (discussed in Remark 7 in Section [4]) may be of 
interest beyond the context of SPARCs, and could be used to develop computationally efficient 
lossy compression algorithms for general sources and distortion measures. 



The results of this paper together with those in [5j|6] show that Sparse Regression codes with 
computationally efficient encoders and decoders can be used for both source and channel coding at 
rates approaching the Shannon-theoretic limits. Further, the source and channel coding SPARCs 



can be nested to effect binning and superposition 17 , which are essential ingredients of coding 
schemes for a large number of multi-terminal source and channel coding problems. Thus SPARCs 
can be used to build computationally efficient, rate-optimal codes for a variety of problems in 
network information theory. 
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We briefly review related work in developing computationally efficient codes for lossy com- 
pression. Gupta, Verdu and Weissman 18 showed that the optimal rate-distortion function of 
memoryless sources can be approached by concatenating optimal codes over sub-blocks of length 
much smaller than the overall block length. Nearest neighbor encoding is used over each of these 
sub-blocks, which is computationally feasible due to their short length. For this scheme, it is not 
known how rapidly the probability of excess distortion decays to zero with the overall block length; 
the decay may be slow if the sub-blocks are chosen to be very short in order to keep the encoding 
complexity low. For sources with finite alphabet, various coding techniques have been proposed 
recently to approach the rate-distortion bound with computationally feasible encoding and decod- 
ing [l~9] - [23"] . The rates of decay of the probability of excess distortion for these schemes vary, but 
in general they are slower than exponential in the block length. 

The survey paper by Gray and Neuhoff 24 contains an extensive discussion of various compres- 
sion techniques and their performance versus complexity trade-offs. These include scalar quantiza- 
tion with entropy coding, tree-structured vector quantization, multi-stage vector quantization, and 
trellis-coded quantization. Though these techniques have good empirical performance, they have 
not been shown to attain the optimal rate-distortion trade-off with computationally feasible en- 
coders and decoders. For an overview and comparison of these compression techniques, the reader 
is referred to [24| Section V] . We remark that many of these schemes also use successive approxi- 
mation ideas to reduce encoding complexity. Lattice-based codes for lossy compression (25 -27 and 
have a compact representation, i.e., low storage complexity. There are computationally efficient 
quantizers for certain classes of lattice codes, but the high-dimensional lattices needed to approach 



the rate-distortion bound have exponential encoding complexity 27 



The paper is organized as follows. Section [2] describes the construction of the sparse regression 
codebook. In Section [3j we describe the encoding algorithm, followed by a heuristic explanation 
of why it attains the Gaussian distortion-rate limit. Section [4] contains the main result of the 
paper, a characterization of the compression performance of SPARCs with the proposed encoding 
algorithm. Various remarks are also made regarding the performance-complexity tradeoff, gap from 
the optimal distortion-rate limit, the successive refinement interpretation etc. Section 4.1 contains 
simulation results illustrating the distortion-rate performance. The proof of the main result is given 
in Section [5} and Section [6] concludes the paper. 

Notation: Upper-case letters are used to denote random variables, lower-case for their realiza- 
tions, and bold-face letters to denote random vectors and matrices. M(n,a 2 ) is used to denote 
the Gaussian distribution with mean fi and variance a 2 . All vectors have length n. The source 
sequence is denoted by S = (Si, . . . , S n ), and the reconstruction sequence by S = (Si, . . . , S n ). ||X|| 
denotes the ^2-norm of vector X, and |X| = HXy/y^ is the normalized version, (a, b) = aibi 
denotes the Euclidean inner product between vectors a and b. All logarithms are with base e 
unless otherwise mentioned. f(x) = o(g(x)) means lim^oo f(x)/g(x) = 0; f(x) = @(g(x)) means 
f{x)/g{x) asymptotically lies in an interval [^1,^2] for some constants Ki,K2 > 0. 
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A: 



Section 1 
M columns 



Section 2 
M columns 



Section L 
M columns 



0: 0, 



O.c,. 0,c 2 ,0, 



Figure 1: A is an n x ML matrix and j3 is a Mi x 1 binary vector. The positions of the non-zeros in (3 
correspond to the gray columns of A which combine to form the codeword A/3. 

2 The Sparse Regression Codebook 

A sparse regression code (SPARC) is denned in terms of a design matrix A of dimension n x ML 
whose entries are i.i.d. A/"(0, 1), i.e., independent zero- mean Gaussian random variables with unit 
variance. Here n is the block length and M and L are integers whose values will be specified shortly 
in terms of n and the rate R. As shown in Figure [TJ one can think of the matrix A as composed 
of L sections with M columns each. Each codeword is a linear combination of L columns, with 
one column from each section. Formally, a codeword can be expressed as A/3, where /3 is a binary- 
valued ML x 1 vector (/3i, . . . ,@ml) with the following property: there is exactly one non-zero /3j 
for 1 < i < M, one non-zero for M + 1 < i < 2M, and so forth. The non-zero value of (3 in 
section i is set to Cj where the value of Cj will be specified in the next section. Denote the set of all 
/3's that satisfy this property by Bm,l- 

Since there are M columns in each of the L sections, the total number of codewords is M L . To 
obtain a compression rate of R nats/sample, we therefore need 



M 



or L log M = nR 



(1) 



Encoder: This is defined by a mapping g : M. n — > Bm,l- Given the source sequence S and target 
distortion D, the encoder attempts to find a /3 G £>m,l such that 

||S - A/3|| 2 < D. 

If such a codeword is not found, an error is declared. In the next section, we present a computa- 
tionally efficient encoding algorithm and characterize its performance in Section |4j 

Decoder: This is a mapping h : Bm,l On receiving (3 £ Bm,l from the encoder, the 

decoder produces reconstruction^] h0) = A/3. 

Storage Complexity: The storage complexity of the dictionary is proportional to nML. There 



2 In the remainder of the paper, we will often refer to ft as the codeword though, strictly speaking, A/3 is the actual 
codeword. 
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are several choices for the pair (M, L) which satisfy ([!]). For example, L = 1 and M = e nR recovers 
the Shannon-style random codebook in which the number of columns in the dictionary A is e nR , 



i.e., the storage complexity is exponential in n. For our constructions in Section 4.1 we will choose 
M to be a low-order polynomial in n. This implies that L is O i^j^^j , and the number of columns 
ML in the dictionary is a low-order polynomial in n. This reduction in storage complexity can 
be harnessed to develop computationally efficient encoders for the sparse regression code. We 
emphasize that the results presented here hold for any choice of (M, L) satisfying 0; the choice of 
(M, L) above offers a good trade-off between complexity and error performance. 

3 Computationally Efficient Encoding Algorithm 

The source sequence S is generated by an ergodic source with mean and variance a 2 . 

The SPARC is defined by the n x ML design matrix A. The jth column of A is denoted Aj, 
1 < j < ML. {ci}f =l , the non-zero values of /3, are chosen to be 



c * = y— [i- T ) , * i /•• (2) 

Given source sequence S, the encoder determines f3 S Bm,l according to the following algorithm. 

Step : Set R = S. 
Step i, % = 1, . . . , L : Pick 

rrii= argmax ( ,. , Aj\ ■ (3) 

j: (i-l)M< j <iM \ ll-^M-lll / 

Set 



Ri = Rj_i - CiA mi , (4) 

where Cj is given by ([2]). 

Step L + 1 : The codeword (3 has non-zero values in positions mi, 1 < i < L. The value of the 
non-zero in section i given by q. 

The algorithm chooses the mi's in a greedy manner - section by section - to minimize a 'residue' 



in each step. In Section 3.2, we give a non-rigorous explanation of why the algorithm succeeds 
(with high probability) in finding a codeword within distortion D of a typical source sequence, for 
rates R larger than R*(D). The formal performance analysis is contained in Sections [4] and [5j 

3.1 Computational Complexity 

The encoding algorithm consists of L stages, where each stage involves computing M inner products 
followed by finding the maximum among them. Therefore the number of operations per source 
sample is proportional to ML. If we choose M = L b for some b > 0, ([T]) implies L = (^i^^j , and 

the number of operations per source sample is of the order (n/logn) 6+1 . We also note that due to 
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the sequential nature of the algorithm, only one section of the design matrix (M columns) needs 
to be kept in memory at each step. 

When we have several source sequences to be encoded in succession, the encoder can have 
following pipelined architecture. There are L modules: the first module computes the inner product 
of the source sequence with each column in the first section of A and determines the maximum; 
the second module computes the inner product of the first-step residual vector with each column 
in the second section of A, and so on. Each module has M parallel units; each unit consists of 
a multiplier and an accumulator to compute an inner product in a pipelined fashion. After an 
initial delay of L source sequences, all the modules work simultaneously. This encoder architecture 
requires computational space (memory) of the order nLM and has constant computation time per 
source symbol. 

The code structure automatically yields low decoding complexity. The encoder can represent 
the chosen f3 with L binary sequences of log 2 M bits each. The ith binary sequence indicates the 
position of the non-zero element in section i. Thus the decoder complexity involved in locating the 
L non-zero elements using the received bits is L log 2 M. Reconstructing the codeword then involves 
L additions per source sample. 

3.2 Heuristic derivation of the algorithm 

In this section, we present a non-rigorous analysis of the proposed encoding algorithm based on the 
following observations. 

1. For 1 < j < ML, |Aj| 2 is approximately equal to 1 when n is large. This is due to the law 
of large numbers since each | Aj| 2 is the normalized sum of squares of n i.i.d Af(0, 1) random 
variables. 

2. Similarly, |S| 2 is approximately equal to a 2 for large n due to the law of large numbers. 

3. If Xi,X% . . . , Xm are i.i.d A/"(0, 1) random variables, then m&x{Xi, . . . , Xm } is approximately 
equal to V2 log M for large M [28] . 

A precise analysis of the deviations of these quantities from their typical values above is given in 
Section [5j 

Step 1: Consider the statistic 

^ (1) -(W' A ')< i - i - (5) 

For each j, is a A/"(0, 1) random variable since it is the projection of i.i.d A/"(0, 1) random 

vector Aj along the direction of Ro, and Ro is independent of Aj. Further, the Tj's are mutually 
independent for 1 < j < M. This can be seen by conditioning on the realization of Ro- Indeed, 
conditioned on 

R o/l|R-o|| = r = (n, . . . ,r n ), 
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we have for j ^ k 



t\ 1] = (r, Aj) = r x Aj X + r 2 A j2 ... + r n A jn , 



( 6 ) 

Tj: = (r, A fc ) = nA kl + r 2 A k2 ... + r n A kn . 



{Aji, . . . , Aj n , A k i, . . . , Akn} are i.i.d M(0, 1) and ri, . . . , r n are constants such that Yli r 1 = 1- 
Hence conditioned on Ro, T^- and are independent M(0, 1) random variables. Since this 

holds for every choice of Ro, 2^ and are independent. 
We therefore have 

From Q, the normalized norm of the residue Ri can be expressed as 

|Ri| 2 = |R-o| 2 + c i|A mi | 2 — ^-(A mi , Ro) 

-lRl 2 -u^ 2 IA i2 ggJjRqjj / R 

— I I ~r c l I A mi | \ A mi > 



n \ ||Ro 

fe^R | 2 + c 2 -^My21oiM (8) 
8, 2 + c?- 2 ^^21ogM 



(c) 2 / 2iA 



In the chain above (a) and (b) follow from ([7]) and the three observations listed at the beginning of 
this subsection, (c) holds by substituting for c\ from ^ and for n from ([!]). 
S^ep i, i = 2, . . . , L: We show that if 



IT? I 2 ~ ^ 2 



then 



|R<|W(i-^Y. 0) 



We already showed that (|9]) is true for i = 1. 

For each j £ {(j - 1)M + 1, . . . , iM}, the statistic 

^-(p^' Aj ) (10) 

is a AA(0, 1) random variable since it is the projection of i.i.d M(0, 1) random vector Aj along the 
direction of Rj— i, and Rj-i is independent of Aj. The latter is true because Rj_i is a function 
of the source sequence S and the columns {A^}, 1 < j < (i — 1)M, which are all independent 
of Aj for j G {(i — 1)M + 1, . . . ,iM}. Further, the T-^'s are mutually independent for j G 
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{(i — 1)M + 1, . . . , iM}. This is seen by conditioning on the realization of Rj_i/||Rj_i|| and using 
arguments identical to those for Step 1 (([6]) and the discussion surrounding it). 
We therefore have 

max Tf = /-5±±-,A m A « ^logM. (11) 

{i-l)M+l<j<iM 3 \||Rj_i|| / 

From Q, we have 

|R.|2 _ | R . ,|2 , 2, A i2 2Ci||Rj-i|| / Rj_i 

n \ H-K-i— 1 

113 |2 , 2 2Ci||Ri_ 1 || /— — 

n 

ffl 2 A 2*y- 2 2 C ^(i-f)- 1 (12) 

(c Vfl-f)\ 



As before, (a) and (6) follow from (11) and the three observations listed at the beginning of this 



subsection, (c) holds by substituting for Cj from ^ and for n from ([!]). It can be verified that the 



chosen value of Cj minimizes the third line of (12). 



Therefore, the residue when the algorithm terminates after Step L is 

|R L | 2 = |S - A/3| 2 ^^{l- 2 ^ < a 2 e~ 2R (13) 

where we have used the inequality (1 + x) < e x for all i£K. 

Thus the encoding algorithm picks a codeword f3 that yields squared-error distortion approxi- 
mately equal to o 2 e~ 2R , the Gaussian distortion-rate function at rate R. Making the arguments 
above rigorous involves bounding the deviation of the residual distortion each stage from its typical 
value. 

4 Main Result 

Theorem 1. Consider a length n source sequence S, generated by an ergodic source having mean 
and variance a 2 . Let 60,61,62 be any positive constants such that 

A ^ Jo + 5i?(5i + 5 2 ) < \. (14) 

Let A be an n x ML design matrix with i.i.d A/*(0, 1) entries and M,L satisfying Q. On the 
SPARC defined by A, the proposed encoding algorithm produces a codeword A/3 that satisfies the 
following for sufficiently large M, L. 

P(\S- A/3| 2 > a 2 e- 2R + 2a 2 e - R A + <r 2 A 2 ) < p + Pl + p 2 (15) 
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where 



Po = Pr 



a 



>s 



pi = 2ML • exp (-n<5 2 /8) , (16) 



P2 



M 25 - 



■2 



8 log M 



-L 



Corollary 1. If the source sequence S generated according to an i.i.d Gaussian distribution J\f(0, a 2 ), 

po < 2exp(-3n<5o/4) 

and the SPARC with the proposed encoder attains the optimal distortion-rate function with proba- 
bility of excess distortion decaying exponentially in L. 

Proof. For an i.i.d Gaussian source, ||S|| 2 is the sum of the squares of n i.i.d M(0, cr 2 ) random 
variables The bound on po is obtained using a Chernoff bound on the probability of the events 
{||S|| 2 > na 2 (l + 8o)} and {||S|| 2 < na 2 (l — do)}. The second part of the statement follows by 
observing that the distortion-rate function of an i.i.d Af(0, a 2 ) source at rate R is o~ 2 e~ 2R , and for 
any fixed So, 61,82 > 0, Po,pi,P2 can be made arbitrarily small for large enough M, L. □ 

Remarks: 



1. The probability measure in (15) is over the space of source sequences and design matrices. 



The codeword (3 is a deterministic function of the source sequence S and design matrix A. 

2. Ergodicity of the source ensures that po — > as n — > 00 at a rate depending only on the 
source statistics. 

3. For a Gaussian source A/"(0, a 2 ), Corollary [j] says that we can achieve a distortion within any 
constant gap of the optimal distortion-rate function D*(R) = a 2 e~ 2R with a probability of 
excess distortion that falls exponentially in L. Recall that if we choose M = L b for b > 0, 
L = O (r^ij- In this case, for any fixed A > the probability that the distortion exceeds 
D*(R) by more than Aa 2 (2e~ R + A) falls exponentially in 

4. For a given rate R, Theorem [T] guarantees that the proposed encoder achieves a squared- 
error distortion close to the Gaussian D*(R) for all ergodic sources with variance a 2 . This 



complements results along the same lines by Sakrison and Lapidoth 12-14 for Gaussian 
random codebooks (i.i.d codewords) with minimum-distance encoding. In fact, Lapidoth |12j 
also shows that for any ergodic source of a given variance, one cannot attain a squared-error 
distortion smaller than the Gaussian D*(R) using a Gaussian random codebook. 

5. Gap from D*{R): To achieve distortions close to the Gaussian D*(R) with high probability, 



we need Po,Pi,P2 to all go to 0. In particular, for P2 — > with growing L, from (16) we 
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require that M 2&2 > 81ogM. Or, 



log log M log 8 
2 > 21ogM + 2 log M ' 



(17) 



To approach D*(R), note that we need n,L,M to all go to oo while satisfying ([!]): n,L for 



the probability of error in ( 16 ) to be small, and M in order to allow 62 to be small according 



to (17). When n, L, M are sufficiently large, (17) dictates how small A can be: the distortion 
is approximately ^^jj— higher than the optimal value D*(R) = a 2 e~ zl ' 



6. Performance versus Complexity Trade-off: Recall that the computational complexity of the 
encoding algorithm is O(ML) operations per source sample. As M, L get larger, this com- 
plexity increases but performance of the algorithm also improves - in terms of the gap from 



the optimal distortion (17) as well as the probability of error (16). Let us consider a few 
illustrative cases. 

• Choosing M = L b for some b > yields (from ([!])) L ~ and M ~ (v^) • ^ n terms 
of block length n, the computational complexity is ((n/log n) b+1 ) and the gap from 

For our simulations described in the 



D*(R) governed by (17) is approximately 



next sub-section, we choose 6 = 3 and L G [50, 100]. 
If we choose M ~ Klogn for k > 0, this leads to L 
complexity is ( , wl ? gw ) , lower than the previous case; however, the gap 82 from (17) 



nR 
log log n ' 



The computational 



is approximately 



log log log n 
log log n 



, i.e., the convergence to D*(R) with n is much slower. 



At the other extreme, consider the Shannon codebook with L = 1, M = e nR . In this 
case, the SPARC consists of only one section and the proposed algorithm is essentially 
maximum-likelihood encoding. The computational complexity is 0(e nR ) (exponential), 
while the gap 62 from (17) is approximately The gap A from D*(R) is now 

dominated by 5q and 5± which are @(l/y/n), consistent with the results in 29 



31 



The storage complexity of the SPARC is proportional to the number of entries in the design 
matrix, given by nML. 

7. Successive Refinement Interpretation: The proposed encoding algorithm may be interpreted 



in terms of successive refinement source coding [15, 16 . We can think of each section of the 
design matrix A as a lossy codebook of rate R/L. For each Section i, i = 1, . . . , L, the residue 
Ri-l acts as the 'source' sequence, and the algorithm attempts to find the column codeword 
within the section that minimizes the distortion. The distortion after Section i is the variance 
of the residue Rj; this residue acts as the 'source' sequence for Section i — 1. Recall that the 



minimum mean-squared distortion with a Gaussian codebook 12 at rate R/L is 



D* = |i? l _ 1 | 2 exp(-2i?/L) w |i?^!| 2 1 



2R 



for R/L < 1. 



(18) 



For L = 1, the factor ML that multiplies the exponential term in P2 can be eliminated via a sharper analysis. 
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The typical value of the distortion in Section i is close to D* since the algorithm is equivalent 
to maximum-likelihood encoding within each section. (See Section 3.2 ^ in particular.) 
However, since the rate R/L is infinitesimal, the deviations from D* in each section can be 
quite large. In fact, the probability of excess distortion within each section falls polynomially 
in n. Despite this, when the number of sections L is large, the final distortion |RfJ is close 
to the typical value a 2 e~ 2R with excess distortion probability that falls exponentially in L. 

We emphasize that the successive refinement interpretation is only true for the proposed 
encoder, and is not an inherent feature of the sparse regression codebook. In particular, an 
important direction for future work is to design encoding algorithms with faster convergence 
to D*(R) while still having complexity that is polynomial in n. 



4.1 Simulation Results 

In this section, we study the performance of the encoder via simulations on source sequences of 
unit variance. A brief remark before we proceed. For the simulations, we use a slightly modified 
version of the algorithm presented in Section [5J In each step i, we replace ^ with 

mi = argmin ||Rj-i — CjAj|| 2 = argmax 2c« (Rj-i, Aj) — c 2 \\ Aj \\ 2 . (19) 
j: (i-l)M< j <iM " j: (i-l)M< j <iM 

When n is large, this is almost the same as the original version in ^ since ||Aj|| 2 ~ 1 for all j. 
We found the modified version to have slightly better empirical performance at high rates, but the 
original algorithm in Section [3] is more amenable to theoretical analysis. 

The top graph in Figure [2] shows the performance of the proposed encoder on a unit variance 
i.i.d Gaussian source. The dictionary dimension is n x ML with M = L b . The curves show the 
average distortion at various rates for 6 = 2 and 6 = 3. The average was obtained from 70 random 
trials at each rate. To keep up with convention, we plot rates in bits rather than nats. The value of 
L was increased with rate in order to keep the total computational complexity (oc nL b+l ) similar 
across different rates. Recall from ([I]) that the block length is determined by 

bL log L 

For example, for the rates 1.08,2.09,3.10 and 4.11 bits/sample, L was chosen to be 46,66,81 and 
97, respectively. The corresponding values for the block length are n = 705, 573, 497, 468 for 6 = 3, 
and n = 470,382,331,312 for 6 = 2. The graph shows the reduction in distortion obtained by 
increasing 6 from 2 to 3. This reduction comes at the expense of an increase in computational 
complexity by a factor of L. We also ran simulations for a unit variance Laplacian source. The 
resulting distortion-rate curve was virtually identical to Figure [2j which is consistent with Theorem 

m 

Gish and Pierce [32] showed that uniform quantizers with entropy coding are nearly optimal at 
high rates and that their distortion for a unit variance source is well-approximated by ^e~ 2R . (R 
is the entropy of the quantizer in nats.) The bottom graph of Figure [2] zooms in on the higher rates 
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Rate (bits/sample) 




2 2.5 3 3.5 4 4.5 

Rate (bits/sample) 



Figure 2: Top: Average distortion of SPARC with the proposed encoder for unit variance Gaussian source. 
The design matrix has dimension n x ML with M — L b . The distortion-rate performance is shown for 6 = 2 
and 6 = 3 along with D*(i?), the Shannon distortion-rate function for an i.i.d Gaussian source. Bottom: 
Focusing on the higher rates. The dashed line is the high-rate approximation for the distortion-rate function 
of an optimal entropy-coded scalar quantizer. 
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and shows the above high-rate approximation for the distortion of an optimal entropy-coded scalar 
quantizer (EC-SQ). Recall from (17) that the distortion gap from D*(R) is of the order oy\ 



^ log log M _ log b + log log L 
2 ~ 2 log M ~ 2b log L ' 

which is comparable to the optimal D*(R) = e~ 2R in the high-rate region. (In fact, 82 is larger 
than D*{R) at rates greater than 3 bits for the values of L and b we have used.) This explains the 
large ratio of the empirical distortion to D* (R) at higher rates. 

In summary, the proposed encoder has good empirical performance, especially at low to moderate 
rates even with modest values of L and b. At high rates, EC-SQs have nearly optimal distortion 
performance. However the high-rate approximation for EC-SQ assumes optimal entropy coding, 
which may not be feasible in practice for reasons of complexity. [II] shows that with minimum- 
distance encoding, SPARCs attain D*{R) with the optimal error exponent. This suggests that an 
interesting direction for future work - how do we design encoding algorithms with lower distortion 
at high rates, while still being computationally feasible? 



5 Proof of Theorem [T] 

The essence of the proof is analyzing the deviation from the typical values of the residual distortion 
at each step of the encoding algorithm. In particular, we have to deal with atypicality concerning 
the source, the design matrix and the minimum computed in each step of the algorithm. 

First, we introduce some notation to capture the deviations from the typical values. The nor- 
malized Euclidean norm of the source is expressed as 

|S| 2 = |Ri| 2 = <t 2 (1 + A ) 2 . (20) 

The norm of the residue at stage i = 1, . . . , L is given by 

iR,i 2 =CT 2 (i-^y(i +Ai ) 2 . (21) 

Aj G [—1, 00) measures the deviation of the residual distortion |R«| 2 from its typical value given in 

Next we express the norm of A mi , the column of A chosen in step i, as 

|A m J 2 = l+ 7i , i = l,...,L. (22) 



Recall that the statistics defined in (10) are i.i.d AA(0, 1) random variables for j G {(i — 1)M + 

4 The constants in Theorem]]] are not optimized, so the theorem does not give a very precise estimate of the excess 
distortion in the high-rate, low-distortion regime. 
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1, . . .,iM}. We write 



max 



(i-l)M+l<j<iM 3 MIR. 



T f = (f5^> A ^) = V / 21ogM(l + e J ), * = !,...,£ (23) 



Thus ej measures the deviation of the maximum computed in step i from y/1 log M. 
Using this notation, we have 



|2 id 2 , ,.2 1 A |2 2c i||R || / R 

n \ Rn 



|Ri| = |Ro| +c 1 |A mi | ( A r 

n \ 

= a\l + A ) 2 + 4(1 + 71 ) - 2Cia(1 + Ao) 721oiM(l + £1 ) 



n 



2n , a ^ , 2^ 2 (1 + 7i) 4fl<r 2 (l + A )(l + ei 

(T (1 + A J + - 



(24) 



L L 

■2 I , 2i A //I , a \2 , 2i?/L 2 



^ ( 1 " V (1 + Ao) + l-2fi/L (A ° + 71 " 261(1 + Ao)) 



From (24) and (21), we obtain 

(1 + A x ) 2 = (1 + A ) 2 + ^^ (A 2 + 71 - 2 £1 (1 + A )). (25) 

Using steps very similar to the above, it can be verified that the deviations of the residue in steps 
i and i—1 are related as 

(1 + A,) 2 = (1 + Aj_i) 2 + r M^ (A?_ 1 + 7i - 2ei(l + A,_!)), i = 1, . . . , L. (26) 



The goal is to bound the final distortion is given by 



2iT L 



|R^ = a^l-— j (l + A L y. (27) 

Therefore we would like to find an upper bound for (1 + A^) 2 that holds under an event with high 
probability. Accordingly, define A as the event where all of the following hold: 

1. |A | < So, 
3- Etx ¥ < h 

for So, S±, 62 as specified in the statement of the Theorem. We upper bound the probability of the 
event A c using the following lemmas. 

Lemma 5.1. For 5 £ (0, 1], P ({ Ef=iM > <*) < 2ML exp (-n5 2 /8) 

Proof. In Appendix |A} □ 
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Lemma 5.2. For 5 > 0, P £f =1 |e;| > S) < ^ 
Proof. In Appendix |B) 

Using these lemmas, we have 



M 2 



log M 



□ 



P(„4 C ) <p +Pl+P2 



(28) 



where Po,pi,P2 are given by (16). The remainder of the proof consists of obtaining a bound for 



(1 + Ax,) 2 under the condition that A holds. We start with the following lemma. 
Lemma 5.3. For all sufficiently large L, when A holds we have 



A; > A r 



AR 



In particular, Aj > 



2' 



t = l, 



1 - 2R/L 



E 



1,...,L. 



(29) 



Proof. We first show that Aj > — i follows from (29). Indeed, (29) implies that 



> Ar 



AR / |7jl + M \ (°) , e \ W -1 



(30) 



where (a) is obtained from the conditions of .4 while (6) holds due to (14) 



We now prove (29) by induction. The statement trivially holds for i = 0. Towards induction, 



assume (29) holds for i — 1 for some i G {1, . .. ,L}. From (26), we obtain 

2R/L 



1 + A« 



> (l + Ai-i 



1 - 2R/L 
2 _J2RTL_ 
1 1 - 2R/L 



( 7i - 2ei(l + Ai_i)) 



(1^1+21^1(1 + ^-1)). 



(31) 



For L large enough, the right side above is positive and we therefore have 



(1 + Ai) > (1 + A<_i) 
> (1 + Ai_i) 
= 1 + Ai_i - 



2R/L f j^j 2| ei | 
1-2R/L [(1 + A^i) 2 1 + Ai_ 
2fl/L f | 7< | 2|e;| 



1-2R/L V(l + Ai_i) 2 ' 1 + Ai_i 
2i?/L f 1 7l 



(32) 



1-2R/L V(l + A 



+ 2& 
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where the second inequality holds since y/1 — x > 1 — x for x E (0, 1). This implies 

2R/L ( | 7i | 



1-2R/L V(l + Ai-i; 
2R/L 



+ 2\€i 



>A^- 1 f^ Z (2N+2N) 



(33) 



(6) 

> A 



4fl 



1 - 2i?/L 



i-l 

E 



4i?/L 
1 - 2i?/L 



(M + M)- 



In the chain above, (a) holds because Aj_i > \, a consequence of the induction hypothesis as 
shown in (30). (6) is obtained by using the induction hypothesis for Aj_i. The proof of the lemma 
is complete. □ 



Lemma 5.4. When A is true and L is large enough that Lemma 5.3 holds, 
I Ail < |AoK + 1 ^R/L Z>^(N + 

3=1 



C 3U' 



< = 1. 



Proof. We prove the lemma by induction. For i = 1, we have from (25) 
(1 + AO 2 = (1 + A ) 2 + ^^ (A 2 + 71 - 2 £l (l + A )) 

< 1 + A 2 + 2|A | + /^. (A 2 + | 7 i| + 2|ei|(l + |A |)) 



(l + |Ao|) 2 



1 + 



2R/L 
2R/L 



1-2R/L V(1 + |A | 



A 2 



+ 



l7i| 



+ 



2|6i| 



(l + |Ao|) 2 (1 + |A 



Therefore, 



(34) 



(35) 



1 + Ai < (l + |Ao|) 
<(1 + |A |) 



1 + 



1 + 



A 2 



+ 



l7i| 



+ 



2ki| 



2R/L 

1-2R/L V(1 + |A |) 2 ' (l + |Ao|) 2 ' (l + |Ao|) 



R/L 



Ag 



+ 



l7i| 



+ 



2ki| 



1-2R/L V(1 + |A |) 2 (l + |Ao|) 2 (l + |Ao|) 
where we have used the inequality y \ + x < 1 + | for x > 0. We therefore have 



Ai < |A | + 

(a) 

< |A | + 



R/L 



A 2 



+ 



l7i| 



X-2R/L V(1 + |A |) (l + |Ao|) 
R/L 



+ 2|ei 



1 - 2R/L 



(|A | + |7i| + 2|ei|) 



< lAnl 1 + 



R/L ^ 2R/L 



1 - 2R/L ) 1 - 2R/L 



■d7il + kil). 



(36) 



(37) 
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In (37), we used |Ao|/(l + |Ao|) < 1 to obtain (a). From Lemma 5.3, we have 

4R/L 



Ai > A c 



AR/L 
1 - 2R/L 



(|7l| + |ei|) > -|A 



1 - 2R/L 



(l7i| + ki| 



(38) 



Combining (37) and (38), we obtain 



|Ai| < |A | 1 + 



R/L 



+ - =^(|7l| + ki| 



(39) 



1 - 2R/L J 1 - 2R/L 

This completes the proof for i = 1. Towards induction, assume that the lemma holds for i — 1 
From (26), we obtain 



(1 + A,) 2 < 1 + Al, + 2|Ai_i| + x ^ /L (A 2 _i + hi\ + 21*1(1 + |Ai_!|)) 



(40) 



Using arguments identical to those in (35 )-p7[), we get 

R/L \ 2R/L 



A, £ |Ai_i| 1 + 



1 - 2R/L J + 1 - 2R/L 



(M + M)- 



From the proof of Lemma 5.3 (see (|33[)), we have 

4R/L 



A,- > A,-_ 



i-1 



1 - 2R/L 



+ M) > -|Ai_i| - 



4R/L 
1 - 2R/L 



Combining (41) and (42), we obtain 



|A,| < l^-il (i + rTr ^ 7Z ) + ^ijzihl + M). 



(41) 



(42) 



(43) 



Using the induction hypothesis to bound |Aj_x| in (43), we obtain 

R/L 



|Ad< IIAok-^ AR,L 



7 = 1 



2iVL; + r^2i^ (l7 ' l + |eil) 



□ 
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Lemma |5.4| implies that when A holds and L is sufficiently large, 
|A L | < |A |^ + x m lt L Y.™ L ~ 3 U\ + M) 




(44) 



< e (5q + 5R(5i + 62)) for large enough L. 
In the above, (a) is true because A holds and (6) is obtained by applying the inequality 1 + x < e 2 



with x 



R/L 



1-2R/L' 

Hence when A holds and L is sufficiently large, the distortion can be bounded as 

\R L \* = a 2 e- 2R (l + A L ) 2 < a 2 e- 2R (l + \A L \) 2 < <rV 2 *(l + e R A) 2 = a 2 e~ 2R + 2a 2 e - R A + a 2 A 2 

(45) 



where (a) follows from (44) by defining A = 5q + 5R(5\ + 82). Combining (45) with (28) completes 
the proof of the theorem. 



6 Discussion 

We have studied a new ensemble of codes for lossy compression where the codewords are structured 
linear combinations of elements of a design matrix. The size of the design matrix is a low-order 
polynomial in the block length, as a result of which the storage complexity is much lower than that 
of the random i.i.d codebook. We proposed a successive- approximation encoder with computational 
complexity polynomial in the block-length. For any ergodic source with known variance, the encoder 
was shown to attain the optimal distortion-rate function of an i.i.d Gaussian source with the same 
variance, with the probability of excess distortion decaying exponentially in j^^. 

The encoding algorithm can be interpreted as successively refining the source over an asymptot- 
ically large number of stages with asymptotically small rate in each stage. We emphasize that the 
successive refinement interpretation is unique to this particular algorithm, and is not an inherent 
property of the sparse regression codebook. We also note that the choice of section coefficients 
was chosen to optimize the encoding algorithm. The coefficients allocate 'power' across sections 
of the design matrix and they are chosen depending on the encoder. For example, the optimal 



(minimum-distance) encoder analyzed in 10 11 has equal- valued section coefficients. 

For the proposed encoder, the gap from the distortion-rate function D*{R) as a function of 



design matrix dimension is 0(loglog A//logM), as given in (17). An important direction for 
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future work is designing computationally-efficient encoders for SPARCs with faster convergence to 



D*(R) with dimension (or block length). The results of 30 31) show that the optimal gap from 



D*(R) (among all codes) is 0(1/ ^/n). The fact that SPARCs achieve the optimal error-exponent 
with minimum-distance encoding [IT] suggests that it is possible to design encoders with faster 
convergence to D* (R) at the expense of slightly higher computational complexity. One such idea is 
make the encoder less 'greedy', i.e., search across multiple sections instead of sequentially picking 



one column at a time. Ideas from sparse signal recovery such as i\ minimization 33-35 may also 



prove useful. Another approach to improve the high-rate distortion performance is to construct a 
few sections of the design matrix in a structured way so as to optimize the shapes of the Voronoi 
cells. 

Another direction for further investigation is exploring design matrices with smaller storage 
complexity. For example, a SPARC defined by a design matrix with i.i.d ±1 entries was found 
to have empirical distortion-rate performance very similar to the Gaussian design matrix. Since 
binary entries imply a much reduced storage requirement compared to Gaussian entries, establishing 
theoretical performance bounds for the ±1 design matrix is an interesting open problem. 

The results of this paper together with those in [5j|6] show that SPARCs with computationally 
efficient encoders and decoders can be used for compression as well as communication, at rates 
approaching the Shannon-theoretic limits. Further, it was shown in |17| that the source and channel 
coding SPARCs can be nested to effect binning and superposition, which are essential ingredients 
of coding schemes for multi-terminal source and channel coding problems. Sparse regression codes 
therefore offer a promising framework to develop fast, rate-optimal codes for a variety of models in 
network information theory. 

APPENDIX 



A Proof of Lemma 5.1 



Recall from (22) that 



We have 



7i = |A m J -1, i = l,...,L. 



L 

<J>(| 7i |>*) 

i=l 
L 

= ^P({|A m J 2 > 1 + 5} U {|A m .| 2 < 1-5}). 



(46) 
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The right-side above can be bounded as 

L 



i=l 



]TP({|A m J 2 >l + <5} U {\A m f<l-5}) 

{\Aj\ 2 > 1 + 5} U {|A/ < 1-6 



AM 

J j=(j-1)M+1 



(47) 



(6) 



<2± P{ » 

i=i 

L iM 

<£ E [P(|A/>l + *)+P(|A,f <!-*)] 

i=l j=(i-l)M+l 

= ML [P(|Aj| 2 > l + 5)+P(\Aj\ 2 <l-<5)] ■ 

(a) follows from the observation that mi G {(i — 1)M + 1, . . . , iM} , i.e., A m ^ is one of the columns 
on Section i of A. (6) is due to the union bound. 

Using a Chernoff bound for P (|A m . | 2 > 1 + <5), we have 



P (|A,f > 1 + 5) = PiWAjf > n{l + 5)) 

< exp(-in(l + 5)) E[exp(t||Aj|| 2 )], t > 
= exp(-tn(l + <5)) (l-2t)-"/ 2 . 



(48) 



The last line is obtained by using the moment generating function of ||Aj|| 2 , a Xn random variable. 
Using t = 2(T±6ji we S et 

P (|A 3 -| 2 > 1 + 5) < exp 



(1 + <5) n/ ^ < exp 



(49) 



where the second inequality above is obtained using the bound ln(l + 8) < S - £ for S 6 [0, 1]. 



Similarly, 



P (|A/ < 1 - 5) = P(\\Aj\\ 2 < n(l - 5)) 

< exp(tn(l - 5)) E[exp(-t||A j || 2 )], t > 
= exp(tn(l - 6)) (1 + 2t)~ n / 2 . 



(50) 



Using t 



2(1-5) 



, we get 



P (|A,| 2 < 1 - 5) < exp f ^ (1 - 5)™/ 2 < exp (- 



(51) 



where we have used log(l — 5) < —5 — ^- for 5 £ [0, 1]. 
Substituting (49) and (51) in (47) completes the proof. 
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B Proof of Lemma 



5.2 



For a random variable X, let fx and Fx denote the density and distribution functions, respectively. 
Recall from ([To} that for i G {1, . . . , L} and j G {(i - l)M + 1, . . . , iM}, the statistic 

^WJ^T,AA. (52) 



, jR-i— 1 
Define for i = 1, . . . , L, 



Zt = max = v / 2loiM(l + (53) 

(i-l)M+l<j<iM J 



We first show that the Zj's in (|53|) are i.i.d and thus 

1, i = l,...,L (54) 



v/21ogM 

are i.i.d random variables. For brevity, we denote the collection {Tj , (i — 1)M + 1, . . . , iM} by 
yw for i = 1, . . . , L. Consider the conditional joint distribution function ^r(i)|x( i - 1 ),...,T( 1 ),R " 

We 

have 

( ) 0) * M 

#r«|T(*-i),...,T(i),R. = ^TWlR^i = II ^ T W- (55) 

j=(i-l)M+l J 



(a) is obtained by using the following observation in (52): each column A,- in the ith section of A 
is independent of {T^ 1 ^, . . . , TW,Ro = S} since the latter are functions of the source sequence 
and the columns in the first i — 1 sections of A. (6) follows from the discussion in Section |3.2| - 
recall that for each fixed i, conditioned on Rj-i = r, {Tj } (i — 1)M + 1 < j < iM are i.i.d J\f(0, 1) 
random variables. Therefore 



iM 



Fk ,tw,...,tw=Fb. 11 U F r 



i=l j=(i-l)M+l 



where {T- } ~ i.i.d A/"(0, 1) Vi,j. Consequently, Zi,i = 1, . . . ,L are i.i.d random variables. 
Using a Chernoff bound, we have 

P \J2 J~ >6 ^ - ( E [ ex P(*l e iD] exp(-t5)) L , Vt >0. (56) 
We choose t = 2 log M and compute the bound. We have 

/>oo />0 

E[exp(i|ei|)] = / exp(tx)f ei (x)dx + / exp(— tx)f ei (x)dx. (57) 
Jo J-oo 
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The first integral can be bounded as follows. 

/oo 
exp(tx)f ei (x)dx 
-oo 

=E [exp(tei)] 

= j ^E[exp(V2loiMZ 1 )] 



(58) 



where the second equality is obtained from (54) and t = 21ogM. Since Z\ is the maximum of 
A/*(0, 1) i.i.d random variables 1 < j < M, we have 



E 



exp( v / 21ogM Zi) 



E 



< E 



max exp (a/2 log M t\ 1] ) 



^ exp( a/2 log MT^) 



(59) 



ME 



[exp (a/ 



(a) 



Mexp 



2 log Mrf) 



M 



where (a) is obtained by evaluating the moment-generating function of a A/"(0, 1) random variable 

(60) 



at a/2 log M. Using (|59|) in (|58|), we obtain 

exp(tx) f ei (x)dx < 1. 



o 



The second integral in (57) can be written as 

f0 



exp(— tx)f ei (x)dx 



/ exp(-tx)f Zl ( a/2 log M (x + 1)) a/2 log Mdx (61) 

J-CX3 



where we have used (54) to express f ei in terms of fz 1 - Using the change of variable z 
y/2 log M(x + 1), we have 

r0 



exp(— tx)f £l (x)dx 



V2 log M 



exp —t 



\/21ogM 



l)j f Zl (z)dz = h + h + l3 (62) 



where 



/ 3 





exp ( —t 

— oo 
y/2 log A/-1 



V21ogM 



1 )fzAz)dz, 







log A/ 



v / 2logM-l 



exp — t 



exp — t 



v/21ogM 



\/21ogM 



1 )fz 1 (z)dz, 



1 )fz 1 (z)dz. 



(63) 
(64) 
(65) 
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We evaluate each of these integrals below. Since Z\ is the maximum of M standard Gaussians, its 
distribution function and density are given by 



F Zl (z) = (Hz)) M , f Zl (z) = McPizmz))™- 1 

where $ and 4> denote the standard Gaussian distribution function and density, respectively. 
I\ can then be written as 



M(§(z)) M - x ${z)dz 



(o) M 
(6) M 3 

~ 2M= 



|^exp(-t- 7 =|=)exp(t)^)dz 
r / exp (-y/2\ogMz) <j){z)dz 

J — oo ^ ' 



' — oo 

M 3 r °° 



2 M 

M 3 



r J exp {s/2 log Mi 



(66) 



z ) 4>{z)dz 



^ 2M=I J exp (V21ogM2j <f>(z)dz 



( C ) M 5 



2 M-r 



In the above, (a) is true because &(z) < - for z < 0, (6) is obtained by substituting i = 2 log M, and 
(c) is obtained by evaluating the moment generating function of a standard Gaussian at 2 log M. 
Next, 

/ / \ \ /V21ogA/-l ' ^ 

<M 3 max exp f-v/2 log Mz ($(z)) M_1 / <f>(x)dx. 

Vze[0,V21ogM-l] V / J Jo 

Let 

= exp (-v^logMz) 

It can be verified that g{z) is an increasing function in z 6 [0, \/2 log M — 1] for large enough M 
(M > 6 is sufficient). Therefore the maximum is attained at \/2 log M — 1 and (67) becomes 

I 2 < M 3 5 (V21ogM-l) = Mexp(y / 21ogM)(<I>(y / 21ogM — 1)) M_1 . (68) 

Claim 1: J 2 — > as M — > oo. 
Proof. Using the bound 

m<(i~ 2 e -^=Si), (69) 

\ 1 + v/27T / 
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we have 



$( v / 2bgM - i; 



M-l 



< 1 



(V21ogM - 1) exp (V21ogM) 
1 + (V21ogM - l) 2 My/2ne 



M-l 



(V2 log M-l) 2 \ /(M- 1)^2 logM 



1 + (V21ogM-l)7 \M(^2\ogM- I) J \{M -l)^2logMV2^ 



exp (V2 log M) 



M-l 



(70) 



(a) 
< I 1 



1 exp(V2 logM) 
(M — 1) ^21ogAf 
exp(V21ogM) 



M-l 



(6) 

< exp 



y/2 log M V47re 

In the above, (a) holds for large enough M, and (b) is obtained using 1 + x < e x . Using this bound 

exp(V21ogM) 



in (68) yields 



h < exp 



+ log M + y/2 log M 



v / 21bgM V47re 

Since the first term of the exponent dominates as M grows large, the claim is proved. 
Finally we bound ^3 as follows. 

V21ogM 



(71) 

□ 



(a) 



exp — t . 

V2 log M-i V Vv21ogM 
1 



1 ) ) M(<5>{z)) M - l 4){z)dz 



exp (y/2 log M u) M($(y / 21ogM - u)) M ~ 1 4>(y / 2 log M - u)dtt 

/ r^, , / 3; / a^, \ \ at 1 exp(v/2 log M u) exp(— u 2 /2) , 

/ exp( v / 2 logM u) ($(V21ogM - u))* 1 " 1 — ^ — ' FV 

JO V27T 



(72) 



where 



(b) 

< max 

~ ue[0,l] 



|L exp(2 v / 21ogM u)($( v / 21ogM - u)) M_1 . 

27T 



In ( |72[ ), (a) is obtained using the change of variable z = \/2 logM — u and (b) by bounding 
exp(-u 2 /2) by 1. 



Using the upper bound (69) for <3? and steps similar to (70), we obtain 



9{u 



) < —j= exp(2-u V / 21ogM) exp 



(1 - 6m) 
yJA-ne log M 



exp (y/2 log Mu) 



(73) 



where > is a generic constant that goes to as M — > 00. It can be verified that the maximum 
of the right side of ( 73 ) is attained at 



log(167relogM) 
U= 2V2fogM {1 + 5m) 
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and from (72), the maximum value is a upper bound for I3: 



/3£ 8VMi±M logM . (74) 



Using (60), (66), Claim 1 and (74) in (57), we conclude that 



E[exp(21ogM |ei|)] < 1+ 8 ^ 27r (l + Sm ) \ ogM < 81ogM (75) 



for sufficiently large M. {5m is a generic constant that goes to as M — > 00. 



Using this in (56), we obtain 

L 

L 




(76) 
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